This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:
- phonetic databases, the case of
emuR and EMU-SDMS, the EMU Speech Database Management System
- Sample data: the LOD database
- Queries and requeries
- Inscpect the database:
serve()
- Calculate duration for vowel categories
- Vowels formants and visualisations with
ggplot2
- Vowel explorer
- Calculating the Pillai distance
Preamble
This tutorial is organised as an R Markdown notebook. When you execute code within the notebook, the results appear beneath the code. In order to do so, you need to have R and RStudio installed. When this R notebook is loaded into RStudio, you can excecute chunks by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter, allowing you to experiment with the code. This handout contains all output of the code (tables, visualisations etc.).
The following libraries will be needed and have to be installed.
library(tidyverse)
── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
✓ ggplot2 3.3.0 ✓ purrr 0.3.3
✓ tibble 3.0.0 ✓ dplyr 0.8.99.9002
✓ tidyr 1.0.2 ✓ stringr 1.4.0
✓ readr 1.3.1 ✓ forcats 0.5.0
Warning: package 'tibble' was built under R version 3.6.2
── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(ggiraph)
library(cowplot)
********************************************************
Note: As of version 1.0.0, cowplot does not change the
default ggplot2 theme anymore. To recover the previous
behavior, execute:
theme_set(theme_cowplot())
********************************************************
library(emuR)
Attaching package: 'emuR'
The following object is masked from 'package:base':
norm
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
library(ggiraph)
library(htmltools)
library(shiny)
library(joeyr, warn.conflicts = FALSE )
This is the "joeyr" package.
library(knitr)
The LOD database
The database contains the audio recordings from the Lëtzebuerger Online Dictionnaire (available here, spoken by one female speaker. The audio files have been automatically segmented with the MAUS tools. We thus have a database conisting of textual data, basically words, and the corresponding audio data. The audio data is segmented into words and phonetic segments (sounds).
This database has been created beforehand. Infos how to create such a database is explained in the EMU-SDMS manual.
We start with loading the database and give an overview of structure and content.
# load database
#db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
db = load_emuDB("lod_emuDB")
INFO: Checking if cache needs update for 1 sessions and 539 bundles ...
INFO: Performing precheck and calculating checksums (== MD5 sums) for _annot.json files ...
INFO: Nothing to update!
# display the overview of the structure and content
summary(db)
Name: lod
UUID: c87793c0-012a-11e9-874b-68b599b5deb4
Directory: /Users/peter.gilles/Documents/Data_Science_Humanities/lod_emuDB
Session count: 1
Bundle count: 539
Annotation item count: 10979
Label count: 13547
Link count: 10440
Database configuration:
SSFF track definitions:
name columnName fileExtension
1 dft dft dft
2 praatFms fm praatFms
Level definitions:
name type nrOfAttrDefs attrDefNames
1 bundle ITEM 4 bundle; source; SAM; MAO;
2 ORT ITEM 2 ORT; KAN;
3 MAU SEGMENT 1 MAU;
Link definitions:
type superlevelName sublevelName
1 ONE_TO_MANY bundle ORT
2 ONE_TO_MANY ORT MAU
Tracks in emuR are acoustic representation of the speech signal, here dft for the waveform (time-amplitude representation) and praatFms for the formant measures of vowels (see below).
Levels in an eumR database stand for level of interlinked linguistic information. bundle is the entire audio file, ORT stands for the orthographical representation of the audio file segmented in its single words. MAU is the segmentation of all phonetic segment (=sounds) of all ORT segments in bundle.
The hierarchical structure of these levels is expressed in the link definitions as ONE-TO-MANY.
Database queries
An emuR database can be queried with a powerful query engine. The first example is a simple query for one word, Aarbecht.
sl = query(db, query = "[ ORT == 'Aarbecht']")
sl
The result is a segment list (sl), containing various information about the found item (time, level, name, database info). The result of the query can also be displayed in the EMU Speech Database Management System.
serve(db, seglist = sl)
The GUI will open in the Viewer pane of RStudio or you can open it in a browser (Chrome preferred).

Here we can also display the hiearchical structure for this database item, which is accessed during queries. bundle is the top-level, representing the entire audio file.
The ORT level contains the nodes for the individual words in the bundle, here the two words Aarbecht and Aarbechten. The dependend level then is MAU (=Munich Automatic Unit) representing the single sounds of the words in ORT.

Two aspects render emuR query system extremly powerful: the use of regular expressions (including negation and other extensions) and the combinated query on different levels of the database.
Let’s try more complex queries:
- regular expression, operator
=~, words beginning with Aarbecht...
sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
sl
Select the vowel [aː] in all words beginning with Aarbecht… Note that in the segment list the label now has changed to the vowel and the respective start-end information is now only for this sound [aː].
sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
sl
- query all words from
ORT where the MAU level does NOT contain the segments nmɑaːətd.
sl = query(db, query = "[ #ORT =~'.*' ^ MAU !~ '[nmɑaːətd]' ]")
sl
With the query the user can compile the data frame from the database which then forms the subset for the phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.
Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.
Vowel explorer
knitr::include_app("https://petergill.shinyapps.io/shinyplay/")
Calculate the Pillai distance
---
subtitle: "Lecture for class: Data science in the Humanities"
title: "Big data in the acoustic phonetic analysis"
knit: (function(input_file, encoding) {
  out_dir <- 'docs';
  rmarkdown::render(input_file,
 encoding=encoding,
 output_file=file.path(dirname(input_file), out_dir, 'index.html'))})
 
author: "Peter Gilles"
date: "29. April 2020, 14h00 - 16h00, University of Luxembourg"
output:
  #tufte::tufte_html:
  #tufte::tufte_handout: 
  html_notebook: 
    toc: true
    toc_depth: 2
    number_sections: true
---

This tutorial will demonstrate how to analyse audio data for acoustic phonetic studies in R. It is mainly intended to demonstrated possible workflows. The topics covered are:

* phonetic databases, the case of `emuR` and `EMU-SDMS`, the EMU Speech Database Management System
* Sample data: the LOD database
* Queries and requeries
* Inscpect the database: `serve()`
* Calculate duration for vowel categories
* Vowels formants and visualisations with `ggplot2`
* Vowel explorer
* Calculating the Pillai distance

# Preamble {-}

This tutorial is organised as  an [R Markdown](http://rmarkdown.rstudio.com) notebook. When you execute code within the notebook, the results appear beneath the code. In order to do so, you need to have R and RStudio installed. When this R notebook is loaded into RStudio, you can excecute chunks by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*, allowing you to experiment with the code. This handout contains all output of the code (tables, visualisations etc.).

The following libraries will be needed and have to be installed.
```{r}
library(tidyverse)
library(ggiraph)
library(cowplot)
library(emuR)
library(tools)
library(rio)
library(ggplot2)
library(magrittr)
library(ggiraph)
library(htmltools)
library(shiny)
library(joeyr, warn.conflicts = FALSE )
library(knitr)
```


# The LOD database

The database contains the audio recordings from the `Lëtzebuerger Online Dictionnaire` (available [here](https://github.com/spellchecker-lu), spoken by one female speaker. The audio files have been automatically segmented with the [MAUS tools](https://clarin.phonetik.uni-muenchen.de/BASWebServices/interface). We thus have a database conisting of textual data, basically words, and the corresponding audio data. The audio data is segmented into words and phonetic segments (sounds).

This database has been created beforehand. Infos how to create such a database is explained in the [EMU-SDMS manual](https://ips-lmu.github.io/The-EMU-SDMS-Manual/).

We start with loading the database and give an overview of structure and content.
```{r}
# load database
#db = load_emuDB("/Users/peter.gilles/Documents/_Daten/LOD-emuDB/lod_emuDB")
db = load_emuDB("lod_emuDB")

# display the overview of the structure and content
summary(db)

```

Tracks in `emuR` are acoustic representation of the speech signal, here `dft` for the waveform (time-amplitude representation) and `praatFms` for the formant measures of vowels (see below). 

Levels in an `eumR` database stand for level of interlinked linguistic information. `bundle` is the entire audio file, `ORT` stands for the orthographical representation of the audio file segmented in its single words. `MAU` is the segmentation of all phonetic segment (=sounds) of all `ORT` segments in `bundle`.

The hierarchical structure of these levels is expressed in the `link definitions` as `ONE-TO-MANY`.



# Database queries

An emuR database can be queried with a powerful query engine. The first example is a simple query for one word, `Aarbecht`.

```{r}
sl = query(db, query = "[ ORT == 'Aarbecht']")
sl
```

The result is a `segment list` (`sl`), containing various information about the found item (time, level, name, database info). The result of the query can also be displayed in the EMU Speech Database Management System. 

`serve(db, seglist = sl)`

The GUI will open in the `Viewer` pane of RStudio or you can open it in a browser (Chrome preferred).

```{r echo=FALSE}
knitr::include_graphics(rep("emu-sdms.png"))
```

Here we can also display the hiearchical structure for this database item, which is accessed during queries. `bundle` is the top-level, representing the entire audio file. 

The `ORT` level contains the nodes for the individual words in the `bundle`, here the two words `Aarbecht` and `Aarbechten`. The dependend level then is `MAU` (=`Munich Automatic Unit`) representing the single sounds of the words in `ORT`. 

```{r echo=FALSE}
knitr::include_graphics(rep("hierarchy.png"))
```

Two aspects render emuR query system extremly powerful: the use of regular expressions (including negation and other extensions) and the combinated query on different levels of the database.

Let's try more complex queries:

- regular expression, operator `=~`, words beginning with `Aarbecht...`
```{r}
sl = query(db, query = "[ ORT =~ 'Aarbecht.*']")
sl
```

Select the vowel [aː] in all words beginning with `Aarbecht`... Note that in the segment list the label now has changed to the vowel and the respective start-end information is now only for this sound [aː].

```{r}
sl = query(db, query = "[ ORT =~ 'Aarbecht.*' ^ #MAU=='aː']")
sl

```

- query all words from `ORT` where the `MAU` level does NOT contain the segments `nmɑaːətd`.

```{r}
sl = query(db, query = "[ #ORT =~'.*' ^ MAU !~ '[nmɑaːətd]' ]")
sl

```

With the query the user can compile the data frame from the database which then forms the subset for the phonetic analysis. We can select e.g. all instances of certain (or all) vowels, specifying the context before or after etc. etc.

Of course, querying for individual segments in the audio file like words or sounds is possible only, if this information has been added to the database before.


# Vowel explorer
```{r}
knitr::include_app("https://petergill.shinyapps.io/shinyplay/")
```

# Calculate the Pillai distance

## from https://joeystanley.com/blog/a-tutorial-in-calculating-vowel-overlap

```{r echo=FALSE}
knitr::include_graphics(rep("https://joeystanley.com/images/plots/overlap_tutorial/pillai_example.png"))


```



